任务:从royaleapi.com上爬取皇室战争的部落信息,并保存在词典中
任务分为两个模块:

  • 1、获取html文件
  • 2、解析html数据

需要用到的其他知识储备:

  • 对html与css的有简单的了解
  • 了解re、urllib、bs4库的主要函数
  • 了解正则表达式

一、获取html文件

直接上代码

import urllib.error, urllib.request
def askURL(url):
    head1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}
    reque = urllib.request.Request(url,headers = head1);
    html = ""
    try:
        response = urllib.request.urlopen(reque)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html

逐行解释:

  • 1、本函数需要用到urllib库
  • 2、head1时一个字典,中保存了用户信息User-Agent。(在访问一个网站时,浏览器会向网站发送自己的身份信息。这段信息便是作为身份信息,提取自我自己的火狐浏览器。假如不设置,浏览器会向网站发送的身份是“urllib”,等于告诉网站“我是爬虫🐶”)
  • 3、try语句块中:
  • response保存了打开的网页文件
  • html保存了以utf8编码的字符串
  • 4、该函数返回值为网页信息的字符串

此时返回的字符串内容如下:

<!doctype html>
<html lang="en">
<head> <meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>任性小部落 #YJ0PPR8J | Clan - RoyaleAPI</title>

<link rel="apple-touch-icon" sizes="180x180" href="https://royaleapi.com/static/img/favicon/apple-touch-icon.png?t=0647bd87d25109655af8f3e1e88c40a0ce2192c9">
<link rel="icon" type="image/png" sizes="32x32" href="https://royaleapi.com/static/img/favicon/favicon-32x32.png?t=c98ef822a9cedab0fe9ee93478d40d899667c661">
<link rel="icon" type="image/png" sizes="16x16" href="https://royaleapi.com/static/img/favicon/favicon-16x16.png?t=21c28d2951994ad25eb03db2315d82e2132d2678">
<link rel="manifest" href="/static/img/favicon/manifest.json?t=34423a0929bf95545ca783822157e9a2b5dd8ba2">
<link rel="mask-icon" href="https://royaleapi.com/static/img/favicon/safari-pinned-tab.svg?t=f09d05bf709e0166cade8ea86d08e805d87c1960" color="#5bbad5">
<link rel="shortcut icon" href="https://royaleapi.com/favicon.ico">
<meta name="msapplication-config" content="https://royaleapi.com/static/img/favicon/browserconfig.xml?t=31082b60c007fac280802a7cea7386a75d20e5d4">
<meta name="theme-color" content="#ffffff">
<link rel="preconnect" href="https://cdn.royaleapi.com" crossorigin>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://www.google-analytics.com">
<link rel="preconnect" href="https://cdnjs.cloudflare.com">
<link rel="preconnect" href="https://cdn.jsdelivr.net">

<meta name="google" content="notranslate" />

<meta name="title" content="任性小部落 #YJ0PPR8J | Clan - RoyaleAPI">
<meta name="description" content="The definitive source about decks, players and teams in Clash Royale. Explore advanced statistics about decks and cards based on millions of games per week.">
<meta name="keywords" content="Clash Royale, stats, analytics, decks, esports, API, strategy, guides, chests, Clash, Royale, RoyaleAPI, data, statistics, meta, best, cards, pro">
<meta property="og:title" content="任性小部落 #YJ0PPR8J | Clan - RoyaleAPI">
<meta property="og:type" content="website">
<meta property="og:site_name" content="RoyaleAPI">
<meta property="og:url" content="https://royaleapi.com/clan/YJ0PPR8J">

<link rel="stylesheet" type="text/css" href="https://cdn.royaleapi.com/static/semantic/dist/semantic.min.css?t=32a6bbf4a921163b72ef7bc18a2eff658b0fcb16" />
<link rel="stylesheet" type="text/css" href="https://cdn.royaleapi.com/static/scss/app.css?t=794ec8229089e2a68160fc83ea1d4f4554394872" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:site" content="@RoyaleAPI">
<meta name="twitter:creator" content="@RoyaleAPI">
<meta name="twitter:title" content="任性小部落 #YJ0PPR8J | Clan - RoyaleAPI">
<meta name="twitter:description" content="The definitive source about decks, players and teams in Clash Royale. Explore advanced statistics about decks and cards based on millions of games per week.">
<meta property="og:image:width" content="127">
<meta property="og:image:height" content="151">
<meta property="og:image" content="https://cdn.royaleapi.com/static/img/badge/Cherry_Blossom_08.png?t=ede825cbfc9d9dec589c7284b72be82548937828">
<meta property="og:description" content="一起来玩耍呀 部落Q群683185709 捐卡500授长老 积极参加部落战 月亮不睡我不睡我是秃头小宝贝(⊙o⊙)!">

(以上显示的是一小部分)


二、将得到的数据进行处理

from bs4 import BeautifulSoup
import re

findPlayerTag = re.compile(r'data-tag=(.*)"')
findPlayerName = re.compile(r'data-name=(.*)"')
findDonate = re.compile(r'<td class="donations right aligned mobile-hide" data-sort-value=(.*)>')

def getdata(url):
    dic = {}
    html = askURL(url)
    soup = BeautifulSoup(html,"html.parser")
    wholedata = soup.select('tbody > tr')
    for item in wholedata:
        s = str(item)
        tag = re.findall(findPlayerTag,s)[0]
        name = re.findall(findPlayerName,s)[0]
        donate = re.findall(findDonate,s)
        info = []
        info += [cutName(str(name))]
        info += [str(donate)]
        dic[str(tag)[1:]] = info
    print(dic)
def cutName(str):
    end = str.find('"',2)
    return str[1:end]

我需要提取到的信息:玩家标签(以变量名tag进行保存)、玩家游戏昵称(以变量名name进行保存)、玩家的捐赠量(以变量名donate进行保存)。
逐行解释:
1、需要使用bs4和re库
2、soup = BeautifulSoup(html,”html.parser”) ,BeautifulSoup执行解析工作,可以将网页处理成树形结构,而html.parser是专门用于html文件的解析器

3、wholedata = soup.select(‘tbody > tr’), select执行搜索工作。这里使用的是css选择器模式,为什么是”tbody > tr“呢?这里需要查看html文件的结构

<tbody>
<tr data-tag="28PVL00L2" data-role="Member" class="role-Member tr_member">
<td data-sort-value="1">1</td>
<td data-sort-value=神奇哟>
<a class="block member_link" data-tag="28PVL00L2" href="/player/28PVL00L2">
神奇哟
<div class="last_seen i18n_duration_short" data-seconds="" data-datetime="20200917T043248.000Z">
3h 17m ago
</div>
</a>
<div class="mobile-show">
<div class="meta">
Member
</div>
</div>
<div class="mobile-show mobile-member-summary">
<div>
<i class="big icons">
<img class="ui verytiny image" src="https://cdn.royaleapi.com/static/img/icon/ic-cards.png?t=73da5b1cae92477331cfa0c20a422c6cd6fa45f6" />
<i class="corner blue arrow up icon"></i>
</i>
58
</div>
<div>
<i class="big icons">
<img class="ui verytiny image" src="https://cdn.royaleapi.com/static/img/icon/ic-cards.png?t=73da5b1cae92477331cfa0c20a422c6cd6fa45f6" />
<i class="corner orange arrow down icon"></i>
</i>
160
</div>
</div>
<div class="join_status" id="join-28PVL00L2"></div>
</td>
<td class="inactivity_content inactivity in_trophies">
<img class="inactive_bar" style="display:none;" src="https://cdn.royaleapi.com/static/img/ui/inactivity-trophies.png?t=2a9ca5964e11b039314532607f6924022cabfc58" />
</td>
<td class="inactivity_content inactivity in_donations">
<img class="inactive_bar" style="display:none;" src="https://cdn.royaleapi.com/static/img/ui/inactivity-donations.png?t=873649431655834474ce80d06a2d7cdcc251ac53" />
</td>
<td class="mobile-hide tablet-hide">
<a href="/player/28PVL00L2/battles">
<img class="ui verytiny image" src="https://cdn.royaleapi.com/static/img/ui/battle.png?t=b176e1aa36db187823a4a93069cba312e838df9b" alt="Battles">
</a>
</td>
<td class="mobile-hide" data-sort-value="3">Member</td>
<td class="mobile-hide tablet-hide" data-sort-value="6325462410230">
28PVL00L2
</td>
<td class="" data-sort-value="5167">
5,167
<img class="ui mini image arena_icon" src="https://cdn.royaleapi.com/static/img/arenas-fs8/arena16-fs8.png?t=3b029e481bd4b71e7e2f5256452db7a032477a68" alt="Arena_L4 / Master I" />
</td>
<td class="right aligned mobile-hide" data-sort-value="12">12</td>
<td class="donations right aligned mobile-hide" data-sort-value="58">
58
<i class="big icons">
<img class="ui verytiny image" src="https://cdn.royaleapi.com/static/img/icon/ic-cards.png?t=73da5b1cae92477331cfa0c20a422c6cd6fa45f6" />
<i class="corner blue arrow up icon"></i>
</i>
</td>
<td class="donations right aligned mobile-hide" data-sort-value="160">
160
<i class="big icons">
<img class="ui verytiny image" src="https://cdn.royaleapi.com/static/img/icon/ic-cards.png?t=73da5b1cae92477331cfa0c20a422c6cd6fa45f6" />
<i class="corner orange arrow down icon"></i>
</i>
</td>
<td class="mobile-hide">
<div class="">
<input id="compare-checkbox-28PVL00L2" class="compareme" type="checkbox" data-tag="28PVL00L2" data-name="神奇哟" />
</div>
</td>
</tr>
<tr data-tag="29LV0UYYR" data-role="Co-Leader" class="role-Co-Leader tr_member">
<td data-sort-value="2">2</td>

可以看到,我想要的信息全部都处在<tbody>下的<tr>标签中,使用tbody > tr便能提取到我想要的信息
4、re是一个正则搜索库,这里主要是用到了它的findall函数,至于它的参数findPlayerTag等其实是定义为全局变量了(关于怎样定义这些变量,请百度“正则表达式”)。观察html文件,标签保存在:data-tag=”28PVL00L2”(28PVL00L2是标签),findPlayerTag = re.compile(r’data-tag=(.*)”‘)就能找到类似的字段。

三、完整代码:


import urllib.error, urllib.request
from bs4 import BeautifulSoup
import re


findPlayerTag = re.compile(r'data-tag=(.*)"')
findPlayerName = re.compile(r'data-name=(.*)"')
findDonate = re.compile(r'<td class="donations right aligned mobile-hide" data-sort-value=(.*)>')
def askURL(url):
    head1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}
    reque = urllib.request.Request(url,headers = head1);
    html = ""
    try:
        response = urllib.request.urlopen(reque)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html


def getdata(url):
    dic = {}
    html = askURL(url)
    soup = BeautifulSoup(html,"html.parser")
    wholedata = soup.select('tbody > tr')
    for item in wholedata:
        s = str(item)
        tag = re.findall(findPlayerTag,s)[0]
        name = re.findall(findPlayerName,s)[0]
        donate = re.findall(findDonate,s)
        info = []
        info += [cutName(str(name))]
        info += [str(donate)]
        dic[str(tag)[1:]] = info
    return dic
def cutName(str):
    end = str.find('"',2)
    return str[1:end]
def main():
    url = "https://royaleapi.com/clan/YJ0PPR8J"
    dic = getdata(url)
    for key,value in dic.items():
        print(key)
        print(value)
        print("\n\n")
main()

输出结果:

28PVL00L2
['神奇哟', '[\'"58"\', \'"160"\']']



29LV0UYYR
['宇宙无敌&amp;地表最强 超级麦麦!', '[\'"326"\', \'"240"\']']



JY289GJ
['宗大宝', '[\'"1013"\', \'"400"\']']



Q9UY98L
[':(*^﹏^*):', '[\'"259"\', \'"240"\']']



2QRRR2JGV
['吉吉思密达', '[\'"288"\', \'"256"\']']

(内容过长,仅展示一小部分)